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Abstract 

One of the difficulties of object recognition stems from the need to overcome the variability in object ap¬ 
pearance caused by factors such as illumination and pose. The influence of these factors can be countered by 
learning to interpolate between stored views of the target object, taken under representative combinations 
of viewing conditions. Difficulties of another kind arise in daily life situations that require categorization, 
rather than recognition, of objects. We show that, although categorization cannot rely on interpolation 
between stored examples, knowledge of several representative members, or prototypes, of each of the cat¬ 
egories of interest can still provide the necessary computational substrate for the categorization of new 
instances. The resulting representational scheme based on similarities to prototypes is computationally 
viable, and is readily mapped onto the mechanisms of biological vision revealed by recent psychophysical 
and physiological studies. 
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1 Introduction 

To be able to recognize objects, a visual system must 
combine the capacity for internal representation and for 
the storage of object traces with the ability to compare 
these against the incoming visual stimuli, namely, images 
of objects. The appearance of an object is determined 
not only by its shape and surface properties, but also 
by its disposition with respect to the observer and the 
illumination sources, by the optical properties of the in¬ 
tervening medium and the imaging system, and by the 
presence and location of other objects in the scene (Ull- 
man, 1996). Thus, to detect that two images belong, 
in fact, to the same three-dimensional object, the visual 
system must overcome the influence of a number of fac¬ 
tors that affect the way objects look. 

The choice of approach to the separation of the intrin¬ 
sic shape of an object from the extrinsic factors affecting 
its appearance depends on the nature of the task faced 
by the system. One of these tasks, which may be prop¬ 
erly called recognition (knowing a previously seen object 
as such), appears now to require little more than stor¬ 
ing information concerning earlier encounters with the 
object, as suggested by the success of view-based recog¬ 
nition algorithms developed in computer vision in early 
1990’s (Poggio and Edelman, 1990; Ullman and Basri, 
1991; Breuel, 1992; Tomasi and Kanade, 1992). In this 
paper, we show that it is surprisingly easy to extend such 
a memory-based strategy to deal with categorization , a 
task that requires the system to make sense of novel 
shapes. Thus, familiarity with a relatively small selec¬ 
tion of objects can be used as a foundation for processing 
(i.e., representing and categorizing) other objects, never 
seen before. 

The theory of representation on which the present ap¬ 
proach is based calls for describing objects in terms of 
their similarities to a relatively small number of refer¬ 
ence shapes (Edelman, 1995b; Edelman, 1997b). The 
theoretical underpinnings of this idea are discussed else¬ 
where (Edelman and Duvdevani-Bar, 1997); here, we 
demonstrate its viability on a variety of objects and ob¬ 
ject classes, and discuss the implications of its successful 
implementation for understanding object representation 
and categorization in biological vision. 

1.1 Visual recognition 

If the appearance of visual objects were immutable and 
unaffected by any extrinsic factors, recognition would 
amount to simple comparison by template matching, a 
technique in which two patterns are regarded as the same 
if they can be brought into one to one correspondence. 
As things stand, the effects of the extrinsic factors must 
be mitigated to ensure that the comparison is valid. The¬ 
ories of recognition, therefore, tend to have two parts: 
one concentrating on the form of the internal represen¬ 
tation into which images of objects are cast, and the 
other on the details of the comparison process. 

A model of recognition that is particularly well-suited 
to the constraints imposed by a biological implementa¬ 
tion has been described in (Poggio and Edelman, 1990). 
This model relies on the observation that the views of a 
rigid object undergoing transformation such as rotation 


in depth reside in a smooth low-dimensional manifold 
embedded in the space of coordinates of points attached 
to the object (Ullman and Basri, 1991; Jacobs, 1996); 
furthermore, the properties of smoothness and low di¬ 
mensionality of this view space manifold are likely to be 
preserved in whatever measurement space is used by the 
front-end of the visual system. The operational conse¬ 
quence of this observation is that a new view of an object 
may be recognized by interpolation among its selected 
stored views, which together represent the object. A cri¬ 
terion that indicates the quality of the interpolation can 
be formed by comparing the stimulus view to the stored 
views, by passing the ensuing proximity values through a 
Gaussian nonlinearity, and by computing a weighted sum 
of the results (this amounts to a basis-function interpo¬ 
lation of the view manifold, as described in section 3.1). 
The outcome of this computation is an estimate of the 
measurement-space distance between the point that en¬ 
codes the stimulus and the view manifold. If a sufficient 
number of views is available to define that manifold, this 
distance can be made arbitrarily independent of the pose 
of the object, one of the extrinsic factors that affect the 
appearance of object views. The influence of the other 
extrinsic factors (e.g., illumination) can be minimized in 
a similar manner, by storing examples that span the ad¬ 
ditional dimensions of the view manifold, corresponding 
to the additional degrees of freedom of the process of 
image formation. 

In the recognition scenario, the tacit assumption is 
that the stimulus image is either totally unfamiliar, or, 
in fact, corresponds to one of the objects known to the 
system. A sensible generic decision strategy under this 
assumption is nearest-neighbor (Cover and Hart, 1967), 
which assigns to the stimulus the label of the object that 
matches it optimally (modulo the influence of the extrin¬ 
sic factors, and, possibly, measurement noise). In the 
view-interpolation scheme, the decision can be based on 
the value of the distance-to-the-manifold criterion that 
reflects the quality of the interpolation (a low value sig¬ 
nifies an unfamiliar object). As we argue next, this ap¬ 
proach, being an instance of the generic nearest-neighbor 
strategy, addresses only a small part of the problem of 
visual object processing. 

1.2 Visual categorization 

Because it assumes that variability in object appearance 
is mainly due to factors such as illumination and pose, 
the standard approach to recognition calls for a com¬ 
parison between the intrinsic shape of the viewed object 
(separated from the influence of the extrinsic factors) 
and the stored representation of that shape. According 
to this view, a good representation is one that makes ex¬ 
plicit the intrinsic shape of an object in great detail and 
with high fidelity. 

A reflection on the nature of everyday recognition 
tasks prompts one to question the validity of this view 
of representation. In a normal categorization situation 
(Rosch, 1978; Smith, 1990), human observers are ex¬ 
pected to ignore much of the shape details (Price and 
Humphreys, 1989). Barring special (albeit behaviorally 
important) cases such as face recognition, entry-level 
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Figure 1: The process of image formation. A family of shapes (say, 4-legged animal-like objects) can be defined 
parametrically, using a small number of variables (Edelman and Duvdevani-Bar, 1997), illustrated symbolically on 
the left by the three “sliders” that control the values of the shape variables. These, in turn, determine the geometry 
of the object, e.g., the locations of the vertices of a triangular mesh that approximates the object’s shape. Finally, 
intrinsic and extrinsic factors (geometry and viewing conditions) together determine the appearance of the object. 


(Jolicoeur et ah, 1984) names of objects correspond to 
categories rather to individuals, and it is the category 
of the object that the visual system is required to de¬ 
termine. Thus, the observer is confronted with potential 
variation in the intrinsic shape of an object, because ob¬ 
jects called by the same name do not, generally, have 
exactly the same shape. This variability in the shape 
(and not merely in the appearance) of objects must be 
adequately represented, so that it can be treated prop¬ 
erly at the categorization stage. 

Different gradations of shape variation call for differ¬ 
ent kinds of action on the part of the visual system. On 
the one hand, moderately novel objects can be handled 
by the same mechanism that processes familiar ones, 
insofar as such objects constitute variations on famil¬ 
iar themes. Specifically, the nearest-neighbor strategy 
around which the generic recognition mechanism is built 
can be allowed to handle shape variation that does not 
create ambiguous situations in which two categories vie 
for the ownership of the current stimulus. On the other 
hand, if the stimulus image belongs to a radically novel 
object — e.g., one that is nearly equidistant, in the 
similarity space defined by the representational system, 
to two or more familiar objects, or very distant from 
any such object — a nearest-neighbor decision no longer 
makes sense, and should be abandoned in favor of a bet¬ 
ter procedure. Such a procedure, suitable for represent¬ 
ing both familiar and novel shapes, is described in the 
next section. 

2 The shape space 

To be able to treat familiar and novel shapes uniformly 
within the same representational framework, it is use¬ 
ful to describe shapes as points in a common param¬ 
eter space. A common parameterization is especially 
straightforward for shapes that are sampled at a pre¬ 


set resolution, then defined by the coordinates of the 
sample points (cf. Figure 1). For instance, a family of 
shapes each of which is a “cloud” of k points spans a 3k- 
dimensional shape space (Kendall, 1984); moving the k 
points around in 3D (or, equivalently, moving around the 
single point in the 3fc-dimensional shape space) amounts 
to changing one shape into another. 

By defining similarity between shapes via a distance 
function in the shape space, clusters of points are made 
to correspond to classes of shapes (i.e., sets of shapes 
whose members are more similar to each other than to 
members of other sets). To categorize a (possibly novel) 
shape, then, one must first find the corresponding point 
in the shape space, then determine its location with re¬ 
spect to the familiar shape clusters. Note that while a 
novel shape may fall in between the clusters, it will in 
any case possess a well-defined representation. This rep¬ 
resentation may be then acted upon, e.g., by committing 
it to memory, or by using it as a seed for establishing a 
new cluster. 

2.1 The high-dimensional measurement space 

Obviously, a visual system has no direct access to what¬ 
ever shape space in which the geometry of distal objects 
may be defined (in fact, the notion of a unique geomet¬ 
rical shape space does not even make sense: the same 
physical object can be described quantitatively in many 
different ways). The useful and intuitive notion of a 
space in which each point corresponds to some shape 
can, however, be put to work by introducing an interme¬ 
diary concept: measurement space. 

A system that carries out a large number of measure¬ 
ments on a visual stimulus effectively maps that stimulus 
into a point in a high-dimensional space; the diversity 
and the large number of independent measurements in¬ 
crease the likelihood that any change in the geometry of 
the distal objects ends up represented at least in some 




of the dimensions of the measurement space. Indeed, in 
primate vision, the dimensionality of the space presented 
by the eye to the brain is roughly one million - the same 
as the number of fibers in each optic nerve. 

Most of this high-dimensional space is empty: a ran¬ 
domly chosen combination of pixel values in an image is 
extremely unlikely to form a picture of a coherent ob¬ 
ject. The locus of the measurement-space points that 
do represent images of coherent objects depends on all 
the factors that participate in image formation, both in¬ 
trinsic (the shapes of objects) and extrinsic (e.g., their 
pose), which together define the proximal shape space. 
Note that smoothly changing the shape of the imaged 
object causes the corresponding point to ascribe a man¬ 
ifold in the measurement space. The dimensionality of 
this manifold depends on the number of degrees of free¬ 
dom of the shape changes; for example, simple morphing 
of one shape into another produces a one-dimensional 
manifold (a curve). Likewise, rotating the object in 
depth (a transformation with two degrees of freedom) 
gives rise to a two-dimensional manifold which we call 
the view space of the object. It turns out that the prox¬ 
imal shape space, produced by the joint effects of de¬ 
formation and transformation, can be safely considered 
a locally smooth low-dimensional manifold embedded in 
the measurement space (Edelman and Duvdevani-Bar, 
1997). 

2.2 Dimensionality reduction and the proximal 
shape space 

In the above formulation, the categorization problem 
becomes equivalent to determining the location of the 
measurement-space representation of the stimulus within 
the proximal shape space. Our approach to this prob¬ 
lem is inspired by the observation that the location of 
a point can be precisely defined by specifying its dis¬ 
tance to some prominent reference points, or landmarks 
(Edelman and Duvdevani-Bar, 1997). Because distance 
here is meant to capture difference in shape (i.e., the 
amount of deformation), its estimation must exclude (1) 
components of measurement-space distance that are or¬ 
thogonal to the shape space, as well as (2) components of 
shape transformation such as rotation. As we shall see, 
a convenient computational mechanism for distance esti¬ 
mation that satisfies these two requirements is a module 
tuned to a particular shape, that is, designed to respond 
selectively to that shape, irrespective of its transforma¬ 
tion. A few such modules, tuned to different reference 
shapes, effectively reduce the dimensionality of the repre¬ 
sentation from that of the measurement space to a small 
number, equal to the number of modules (Figure 2). In 
the next section, we describe a system for shape cate¬ 
gorization based on a particular implementation of this 
approach, which we call the Chorus of Prototypes (Edel¬ 
man, 1995b); its relevance as a model of shape processing 
in biological vision is discussed in section 5. 

3 The implementation 

A module tuned to a particular shape will fulfill the first 
of the two requirements stated above - ignoring the ir¬ 
relevant components of the measurement-space distance 


- if it is trained to discriminate among objects all of 
which belong to the desired shape space. Such a train¬ 
ing imparts to the module the knowledge of the relevant 
measurement-space directions, by making it concentrate 
on the features that help discriminate between the ob¬ 
jects. To fulfill the second requirement - insensitivity to 
shape transformations - the module must be trained to 
respond equally to different views of the object to which 
it is tuned. A trainable computational mechanism capa¬ 
ble of meeting these two requirements is a radial basis 
function (RBF) interpolation module. 


3.1 The RBF module 

When stated in terms of an input-output relationship, 
our goal is to build a module that would output a 
nonzero constant for any view of a certain target ob¬ 
ject, and zero for any view of all the other objects in the 
training set. Because only a few target views are usually 
available for training, the problem is to interpolate the 
view space of the target object, given some examples of 
its members. With basis function interpolation (Broom- 
head and Lowe, 1988), this problem can be solved by 
a distributed network, whose structure can be learned 
from examples (Poggio and Girosi, 1990). 

According to this method, the interpolating function 
is constructed out of a superposition of basis functions, 
whose shape reflects the prior knowledge concerning the 
change in the output as one moves away from the data 
point. In the absence of evidence to the contrary, all di¬ 
rections of movement are considered equivalent, making 
it reasonable to assume that the basis function is radial 
(that is, it depends only on the distance between the ac¬ 
tual input and the original data point, which serves as its 
center). The resulting scheme is known as radial basis 
function (RBF) interpolation. Once the basis functions 
have been placed, the output of the interpolation mod¬ 
ule for any test point is computed by taking a weighted 
sum of the values of all the basis functions at that point. 

An application of RBF interpolation to object recogni¬ 
tion has been described in (Poggio and Edelman, 1990); 
the RBF model was subsequently used to replicate a 
number of central characteristics of the process of recog¬ 
nition in human vision (Bulthoff and Edelman, 1992). 
In its simple version, one basis function is used for 
(the measurement-space representation of) each famil¬ 
iar view. The appropriate weight for each basis is then 
computed by an algorithm that involves matrix inver¬ 
sion (a closed-form solution exists for this case). This 
completes the process of training the RBF network. To 
determine whether a test view belongs to the object on 
which the network has been trained, this view (that is, its 
measurement-space representation) is compared to each 
of the training views. This step yields a set of distances 
between the test view and the training views that serve 
as the centers of the basis functions. In the next step, 
the values of the basis functions are combined linearly 
to determine the output of the network (see Figure 3, 
inset, and appendix A). 




Figure 2: A schematic illustration of the shape-space manifold defined by a Chorus of three active modules (lion, 
penguin, frog). Each of the three reference-shape modules is trained to ignore the viewpoint-related factors (the 
view space dimension, spanned by views that are shown explicitly for lion), and is thus made to respond to shape- 
related differences between the stimulus (here, the giraffe) and its “preferred” shape. The actual dimensionality 
of the space spanned by the outputs of the modules (Edelman and Intrator, 1997) can be lower than its nominal 
dimensionality (equal to the number of modules); here the space is shown as a two-dimensional manifold. 


3.2 Multi-classifier network design 

A multi-classifier network is constructed by combining 
several single-shape modules, each tuned to a different 
shape class. The multi-classifier network is trained ac¬ 
cording to the algorithm described in appendix C.l. The 
response properties of such a network are illustrated in 
Figure 16, which shows the activity of several RBF mod¬ 
ules for a number of views of each of the objects on which 
they had been trained. As expected, each module’s re¬ 
sponse is the strongest for views of its preferred shape, 
and is weaker for views of the other shapes. Significantly, 
the response is rarely very weak; this feature contributes 
to the distributed nature of the representation formed 
by an ensemble of modules, by making several modules 
active for most stimuli. 1 

It has been hypothesized (Edelman et ah, 1996) that 
the ensemble of responses produced by a collection of 
object-specific modules can serve as a substrate for car- 


1 Note that much more information concerning the shape 
of the stimulus is contained in the entire pattern of activi¬ 
ties that it induces over the ensemble of the reference-object 
modules, compared to the information in the identity of the 
strongest-responding module (Edelman et al., 1992). Typical 
object recognition systems in computer vision, which involve 
a Winner Take All decision, opt for the latter, impoverished, 
representation of the stimulus. 


rying out classification of the stimulus at superordinate, 
basic, or subordinate levels of categorization (Rosch 
et al., 1976; Rosch, 1978), depending on the manner in 
which the response vector is processed. In the next sec¬ 
tion we describe a series of computational experiments 
that examine the representational capabilities of a multi¬ 
classifier network in a range of tasks. 

4 Experimental results 

In all our computational experiments we used three- 
dimensional object geometry data available as a part of 
a commercial database that contains several hundreds 
of shapes. Ten reference objects were chosen at random 
from the database, to serve as the prototypes for the 
multi-classifier network implementation of the Chorus 
scheme (see Figure 5). 

To focus on the problem of shape-based recognition, 
objects were rendered under the Lambertian shading as¬ 
sumption, using a simulated point light source situated 
at the camera, a uniform gray surface color, and no tex¬ 
ture. Each object was presented to the system sepa¬ 
rately, on a white background, at the center of a 256 x 256 
window; the maximal dimensions of the 3D bounding 
boxes of the objects were normalized to a standard size 
(about one half of the size of the window). Thus, the 
problems of figure-ground segmentation and of transla- 








Figure 3: The Chorus scheme (section 3). The stimulus is first projected into a high-dimensional measurement 
space, spanned by a bank of receptive fields. Second, it is represented by its similarities to reference shapes. In this 
illustration, only three modules respond significantly, spanning a shape space that is nominally three-dimensional (in 
the vicinity of the measurement-space locus of giraffe images). The inset shows the structure of each module. Each of 
a small number of training views, v*, serves as the center of a Gaussian basis function Q (a, b; a) = exp (||a — b|| 2 /cr 2 ); 
the response of the module to an input vector x is computed as y = w t G (x; v*). The weights w t and the spread 
parameter a are learned as described in (Poggio and Girosi, 1990). It is important to realize that the above approach, 
which amounts to an interpolation of the view space of the training object using the radial basis function (RBF) 
method, is not the only one applicable to the present problem. Other approaches, such as interpolation using the 
multilayer perceptron architecture, may be advantageous, e.g., when the measurement space is “crowded,” as in face 
discrimination (Edelman and Intrator, 1997). 


tion and scale invariance were effectively excluded from 
consideration. 

The performance of the resulting 10-module Chorus 
system was assessed in three different tasks: (1) identi¬ 
fication of novel views of the ten objects on which the 
system had been trained, (2) categorization of 43 novel 
objects belonging to categories of which at least one ex¬ 
emplar was available in the training set, and (3) discrim¬ 
ination among 20 novel objects, chosen at random from 
the database. 


4.1 Identification of novel views of familiar 
objects 

The ability of the system to generalize identification to 
novel views was tested on the ten reference objects, for 
each of which we had trained a dedicated RBF module. 
We experimented with three different identification algo¬ 
rithms, whose performance was evaluated on a set of 169 
views, taken around the canonical orientation specific for 
each object (Palmer et ah, 1981). The test views ranged 
over ±60° in azimuth and elevation, at 10° increments. 


4.1.1 Identification results 

We first computed the performance of each of the ten 
RBF modules using individually determined thresholds. 
For each module, the threshold was set to the mean ac¬ 
tivity on trained views 2 less one standard deviation. The 
performance of each of the ten modules on its training 
object is summarized in Table 1. As one can see, the 
residual error rates were about 10%, a figure that can 
probably be improved if a more powerful architecture or 
a more extensive learning procedure are used. The gen¬ 
eralization error rate (defined as the mean of the miss 
and the false alarm rates, taken over all ten reference 
objects) for the individual-threshold algorithm was 7%. 

We next considered the Winner-Take-All (WTA) al¬ 
gorithm, according to which the outcome of the identi¬ 
fication step is the label of the module that gives the 
strongest response to the current stimulus (in Table 4, 
appendix D, entries for modules that responded on the 
average the strongest are marked by bold typeface). The 
error rate of the WTA method was 10%. 

Only about a tenth of the 169 views, determined by 
canonical vector quantization (see appendix B.l), had been 
used in training the modules. 








Figure 4: An image of a 3D object, overlayed by the outlines of the receptive fields (RFs) used to map object views 
into a high-dimensional measurement space (see section 2.1 and appendix B). The system described here involved 
200 radially elongated Gaussian RFs; only some of them are drawn in this figure. 



cowl 

cat 

A1 

gene 

tuna 

Lrov 

Niss 

F16 

fiy 

TRex 

miss rate 

0.11 

0.14 

0.02 

0.01 

0.13 

0.04 

0.03 

0.10 

0.16 

0.05 

false alarm rate 

0.08 

0.11 

0.07 

0.02 

0.11 

0.05 

0.04 

0.12 

0.12 

0.03 


Table 1: Individual shape-specific module performance. The table shows the miss and the false alarm rates of 
modules trained on the objects shown in Figure 5. The generalization error rate (defined as the mean of the miss 
and the false alarm rates) was 7%. 


Finally, we trained a second-level RBF module to map 
the 10-element vector of the outputs of the reference- 
object modules into another 10-dimensional vector only 
one of whose elements (corresponding to the actual iden¬ 
tity of the input) was allowed to assume a nonzero value 
of 1; the other elements were set to 0 (Edelman et ah, 
1992). This approach takes advantage of the distributed 
representation of the stimulus by postponing the Winner 
Take All decision until after the second-level module has 
taken into account the similarities of the stimulus to all 
reference objects. Indeed, the WTA algorithm applied 
to the second-level RBF output resulted in an error rate 
of 6%. 

4.1.2 Lessons from the identification 
experiments 

The purpose of the first round of experiments was to 
ensure that the system of reference-object modules could 
be trained to identify novel views of those objects. The 
satisfactory performance of the RBF modules, which did 
generalize to novel views of the training objects, allowed 
us to proceed to test the entire system in a number of 
representation scenarios involving novel shapes, as de¬ 
scribed below. We note that one cannot expect the per¬ 
formance on novel objects to be better than that on the 
familiar ones. Thus, the figure obtained in the present 
section — about 10% error rate — sets a bound on the 


performance in the other tasks. To improve that, one 
may attempt to employ an alternative learning mecha¬ 
nism (as suggested above), in conjunction with a better 
image transduction stage, instead of the 200 Gaussian 
RFs we used here. 

4.2 Categorization of novel object views 

Our second experiment tested the ability of the Chorus 
scheme to categorize “moderately” novel stimuli, each 
of which belonged to one of the categories present in 
the original training set of ten objects. To that end, we 
used the 43 test objects shown in Figure 6. To visualize 
the utility of representation by similarity to the train¬ 
ing objects, we used multidimensional scaling (Shepard, 
1980) to embed the 10-dimensional layout of points cor¬ 
responding to various views of the test objects into a 
two-dimensional space (Figure 7). An examination of the 
resulting plot revealed two satisfying properties. First, 
views of various objects clustered by object identity (and 
not, for instance, by pose, as in patterns derived by 
multidimensional scaling from distances measured in the 
original pixel space). Second, in Figure 7 views of the 
QUADRUPEDS, the AIRPLANES and the CARS categories all 
form distinct “super-clusters.” 

To assess the quality of this representation numeri¬ 
cally, we used it to support object categorization. A 
number of categorization procedures were employed at 





Figure 5: The ten training objects used as reference shapes in the computational experiments described in the text, 
organized by object categories. The objects were chosen at random from a collection available from Viewpoint 
Datalabs, Inc. (http://www.viewpoint.com/). 


this stage. In every case, the performance of the 10- 
dimensional Chorus-based representation was compared 
to that of the original multidimensional receptive-field 
(RF) measurement space (see Figure 4; we shall return 
to discuss this comparison later on). 

The various categorization procedures we used were 
tested on the same set of 169 views per object as before. 
First, we assigned a category label to each of the ten 
training objects (for instance, cow and cat were both la¬ 
beled as QUADRUPEDS). Second, we represented each test 
view as a 10-element vector of RBF-module responses. 
Third, we employed a categorization procedure to deter¬ 
mine the category label of the test view. Each view that 
was attributed to an incorrect category by the catego¬ 
rization procedure was counted as an error. 

The category labels we used are the same as the la¬ 
bels given to the various groups of objects in Figure 5. 
Note that a certain leeway exists in the assignment of 
the labels. Normally, these are determined jointly by 
a number of factors, of which shape similarity is but 
one. For example, a fish and a jet aircraft are likely 
to be judged as different categories; nevertheless, if the 
shape alone is to serve as the basis for the estimation of 
their similarity, these categories may coalesce. We tested 
this assumption in an independent psychophysical exper¬ 
iment (Duvdevani-Bar, 1997), in which human subjects 
were required to judge similarity among the same shapes 
used in the present study, on the basis of shape cues only. 
Similarity scores 3 obtained in this experiments revealed 
a clustering of object shapes in which the fly belonged 
to the FIGURES category, and AIRcraft were interspersed 
within the FISH category. 

A careful examination of the confusion tables pro¬ 
duced by the different categorization methods we de¬ 
scribe below revealed precisely these two phenomena as 

Q 

Score data were gathered using the tree construction 
method (Fillenbaum and Rapoport, 1979), and were sub¬ 
mitted to multidimensional scaling analysis (SAS procedure 
MDS, 1989) to establish a spatial representation of the dif¬ 
ferent shapes. 


the major sources of miscategorization errors. First, 
the fly classifier turned out to be highly sensitive to 
the members of the FIGURES category. Second, the 
tuna module was in general more responsive to AIRcraft 
than the F16 module (the sole representative of AIRcraft 
among the reference objects). To quantify the effects 
of this ambiguity in the definition of category labels on 
performance, we compared three different sets of labels 
for the reference objects. The first set of category labels 
is the one shown in Figure 5. The second set differs from 
the first one in that it labels the fly as a FIGURE; in the 
third set, the tuna and the F16 have the same category 
label. 

4.2.1 Winner-Take-All (WTA) 

According to the WTA algorithm, the label of the 
module that produces the strongest response to the novel 
stimulus determines its category membership. We note 
that the WTA method is incompatible with the central 
tenet of the Chorus approach — that of distributed rep¬ 
resentation. To be informative, a representation based 
on similarities to reference objects requires that more 
than one module respond to any given stimulus. A sys¬ 
tem trained with this requirement in mind is expected 
to thwart the WTA method by having different modules 
compete for a given stimulus, especially when the latter 
does not quite fit into any of the familiar object cate¬ 
gories. Indeed, in this experiment the WTA algorithm 
yielded a high misclassification rate of 45% over the 43 
test objects for the first set of category labels. Adding 
a second-stage RBF module trained as described in sec¬ 
tion 4.1 reduced this figure to 30%. When the second 
and the third set of category labels were used, misclas¬ 
sification rate decreased to 32%, and 25%, respectively. 
Carrying out the WTA algorithm in the second-stage 
RBF space reduced both those figures to 23%. 

4.2.2 k- NN using multiple views 

We next examined another simple categorization 
method, based on the k Nearest Neighbor (fc-NN) prin¬ 
ciple (Duda and Hart, 1973). The categorization module 
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Figure 6: The 43 novel objects used to test the categorization ability of the model (see section 4.2); objects are 
grouped by shape category. 
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Figure 7: A 2D plot of the 10-dimensional shape space spanned by the outputs of the RBF modules; multidimensional 
scaling (MDS) was used to render the 10D space in 2D, while preserving as much as possible distances in the original 
space (Shepard, 1980). Each point corresponds to a test view of one of the objects; nine views of each of the ten 
training and five novel objects (buffalo, penguin, marlin, Isuzu, F15, marked by *’s). Note that views belonging 
to the same object tend to cluster (part of the residual spread of each cluster can be attributed to the constraint, 
imposed by MDS, of fitting the two dimensions of the viewpoint variation and the dimensions of the shape variation 
into the same 2D space of the plot). Note also that clusters corresponding to similar objects (e.g., the QUADRUPEDS) 
are near each other. The icons of the objects appear near the corresponding view clusters; those of five novel objects 
are drawn in cartouche. 
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was made to store N views of each reference object, each 
represented as a point in the 10-dimensional space of 
module outputs (10 N views altogether were stored). The 
category of a test view was then determined by polling 
the k reference views that turned out to be the closest to 
the test view in the 10D space. The label of the majority 
of those k views was assigned to the test view. 

The performance of this method for the third set of 
category labels is summarized in Figure 8, which shows 
the categorization rates for different values of k and N, 
averaged over the 43 test objects. Note that the misclas- 
sification error rate decreases with the number of views 
considered, possibly because the relative amount of re¬ 
liable information available in the neighborhood of the 
test view increases. In contrast, the tendency to err in¬ 
creases with k. The mean misclassification rate for this 
set of labels was 29% (41% and 31% for the first and sec¬ 
ond cases, respectively). In comparison, when the 200- 
dimensional measurement space was used to represent 
the individual views, the mean error rate was 37%, 34%, 
and 32% for the first, second and third sets of category 
labels, respectively. 

4.2.3 1-NN using centers of view clusters 

A variation on the above method is to use clusters 
of views of the reference objects, rather than individ¬ 
ual views. If the clusters are tight, their centroids ap¬ 
proximate them well. Accordingly, we used the cen¬ 
troid of the set of training views of each object (cast 
into the 10D space) as the representative member of 
that object’s cluster. Categorization followed the Near¬ 
est Neighbor principle, which, in line with the notation 
of the preceding section, may be called the 1-NN algo¬ 
rithm. This procedure resulted in misclassification rates 
of 20%, 17%, and 15% for the three different sets of cate¬ 
gory labels. The 1-NN procedure showed a clear benefit 
of the 10-dimensional RBF-module representation over 
the 200-dimensional measurement space, where the same 
procedure yielded misclassification rates of 30%, 25%, 
and 23%, for the three sets of category labels. 

4.2.4 k- NN to the training views 

The previous method assumed that clusters are well- 
represented by their means, which is not necessarily true 
in practice. Likewise, the assumption that an unlimited 
number of views of the training objects is available for 
use in the scheme of section 4.2.2 is not always justi¬ 
fied. The use of all and only those views that were ac¬ 
tually employed in the training of the 10 RBF modules 
circumvents both these problems. Thus, the last catego¬ 
rization method we tested involved the k -NN algorithm 
along with the training views specific to each of the RBF 
modules. At the first level of the RBF representation 
space, this method yielded mean misclassification rate 
of 23%, 16% and 14% for the three sets of category la¬ 
bels; average is taken over values of k ranging from 1 
to 9. In the measurement space, the misclassification 
rates were slightly higher; on average over the same val¬ 
ues of fc, misclassification rates for the three category 
label sets were 23%, 22% and 20%. Tables 6 (in ap¬ 
pendix D) and 2 give the detailed errors obtained for 
the third set of category labels, for k = 3. Note how 


the definition of category labels of the reference objects 
affects the resulting misclassification rate. 

4.2.5 Lessons from the categorization 
experiments 

The pattern of the performance of the various algo¬ 
rithms we tested in the categorization tasks conforms to 
the expectations. Specifically, the RBF representation 
was better than the “raw” 200-dimensional measurement 
space. Although the latter outcome was not uniform (as 
apparent in the nearly identical performance of the RBF 
and the measurement spaces in some conditions), it was 
quite consistent under conditions that we consider more 
realistic (e.g., when view-cluster centers, or the actual 
training views were used in the representation; see sec¬ 
tions 4.2.3 and 4.2.4), and for the more appropriate def¬ 
initions of the categorization task (i.e., for the second 
and third sets of category labels). 

Despite those encouraging results, the performance 
of the system in the categorization experiments (about 
80%) falls short by 10 — 15% of the human performance in 
comparable circumstances. We list possible explanations 
of this shortcoming in the general discussion section. 

4.3 Discrimination among object views 

Our third experiment tested the ability of the Chorus 
scheme to represent 20 novel objects (shown in Figure 9), 
picked at random from the database, and to support 
their discrimination from one another. The tests in¬ 
volved the same arrangement of 169 views per object 
as before. The representation of the test objects is de¬ 
scribed in Table 5, which shows the activation of the ten 
reference-shape RBF modules produced by each of the 
test objects. 

4.3.1 Discrimination results 

It is instructive to consider the patterns of similar¬ 
ities revealed in this distributed 10-dimensional repre¬ 
sentation of the test objects. For instance, the giraffe 
turns out to be similar to the two quadrupeds present 
in the training set (cow and cat), as well as to the di¬ 
nosaur (TRex), for obvious reasons (it is also similar to 
the tuna and to the fly, for reasons which are less ob¬ 
vious, but immaterial: both these reference shapes are 
similar to most test objects, which makes their contribu¬ 
tion to the representation uninformative). Thus, in the 
spirit of Figure 2, the giraffe can be represented by 
the vector [1.87 1.93 1.72] of similarities to the three ref¬ 
erence objects which turn out to be informative in this 
discrimination context (cow, cat, TRex). 

As in Figure 7, the model clustered views by object 
identity, and grouped view clusters by similarity between 
the corresponding objects. In a quantitative estimate of 
this performance, we used the k -NN algorithm, as ex¬ 
plained in section 4.2.2, with labels correspond to object 
identity rather than to object category. The k -NN pro¬ 
cedure that relied on proximities to the 169 views of each 
of the reference objects yielded a mean error rate (aver¬ 
aged over values of k ranging from 1 to 9) of 5% over 
the 169 test views of the 20 novel objects. When only 25 
views spanning the range of ±20° around the canonical 
orientation of each test object were considered, the mean 
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Table 2: The individual errors for each category of test objects (see Table 6 for details). Note how the error rates 
decrease for the test objects of the FIGURES category in the second case, and for the test objects of the AIR category 
in the third case. 
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Figure 8: The performance of the k -NN procedure described in section 4.2.2 for the third set of category labels, 
plotted vs. k and N. The plots show the misclassification rate for the 43 test objects shown in Figure 6. Left: 
errors using the measurement-space representation; the mean misclassification error is 32%. Right: the same, for the 
RBF-module representation space; the mean misclassification error is 29%. 


error rate dropped to 1.5%. This improvement may be 
attributed in part to the exclusion of non-representative 
views, e.g., the head-on view of the manatee, which is 
easily confused with the top view of the lamp. In the 
RF-representation case, same experiment yielded error 
rate of 1% with respect to the 169 views, whereas no 
error occurred when 25 views of all 20 objects were con¬ 
sidered. 

When the same procedure was carried out for the 43 
test objects of Figure 6, error rate was on the average 
higher, because these objects resemble each other more 
closely. The mean error rate (averaged over values of 
k ranging from 1 to 9) for the 169 test views of the 43 
objects was 15% in the RRF space and 7% in the RF- 
representation space. 

4.3.2 Lessons from the discrimination 
experiments 

When objects are highly dissimilar from one another, 
discrimination (which requires that the objects be rep¬ 
resented with the least possible confusion) is relatively 
easy. In that case, the measurement space representation 
is effective enough. To see that, one may compare the 
discrimination results obtained with the measurement- 
space representation of the set of 20 highly distinct novel 
objects of Figure 9 to the results obtained with the same 
method on the measurement-space representation of the 


43 objects (Figure 6) used before. The advantage of the 
measurement-space representation over the RBF space 
in some discrimination tasks stems from the higher di¬ 
mensionality and hence higher informativeness of the for¬ 
mer. This high dimensionality is, however, a liability 
rather than an asset in generalization and other catego¬ 
rization tasks, an observation that is supported by our 
data. 

To quantify the ability of the model to reduce the di¬ 
mensionality of the measurement space, we estimated its 
performance with a varying number of reference objects, 
holding the size of the test set fixed. In addition, we 
quantified the extent of dimensionality reduction that 
could be afforded under the constraint of a specific pre¬ 
set discrimination error. Figure 10, left, shows the dis¬ 
crimination error rate obtained with the 3-NN method 
described in section 4.2.2 (using 25 views per test ob¬ 
ject), plotted against the number of reference and test 
objects (see also Table 7 in appendix D). Figure 10, 
right, shows the number of reference objects required to 
perform the discrimination task (using the 3-NN method 
on 25 views per test object) with an error rate less than 
10%, for a varying number of test objects. To the extent 
that it could be tested with the available data, the scal¬ 
ing of the model’s performance with the number of test 
objects seems to be satisfactory. 
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Figure 9: The 20 novel objects, picked at random from the object database, which we used to test the representational 
abilities of the model (see section 4.3). 
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Table 3: A summary of misclassification error rates exhibited by the various methods of section 4.2, for the three sets 
of category labels, using both the 200-dimensional measurement space and the 10-dimensional RBF representation 
space. The error rate improved with each categorization method we introduced. The Winner-Take-All (WTA) of 
section 4.2.1 produced the highest error, which was reduced when a second-stage RBF module was added. The 
k -NN method of section 4.2.2, using multiple views around the test view, produced similar error rates, which were 
significantly improved by using centers of view clusters (1-NN) (see section 4.2.3), or when the k -NN method involving 
the training views was used (section 4.2.4). For the last three methods, the error obtained in the RF measurement 
space was higher that the corresponding error obtained in the RBF space. Note that under all methods, the errors 
improved when the second and the third sets of category labels were used. 
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Figure 10: Left: the mean discrimination error rate plotted against the representation dimensionality (the number of 
reference objects) and the size of the test set (the number of test objects). The means were computed over 10 random 
choices of reference and test objects. See Table 7 in appendix D for performance figures. Right: the dimensionality 
of the representation (the number of reference objects) required to perform discrimination with an error rate of 
10% or less, for a varying number of test objects. The data for this plot were obtained by repeating the task of 
discriminating among the views of N t test objects represented by the activities of N p reference objects 2500 times; 
this corresponded to 10 independent choices of N t test objects out of a set of 50 test objects (five values of N t were 
tested: 2,5,10,25,50), and to 10 random selections of N p = 1, 5,10,15, 20 out of the 20 available reference objects. 


5 Discussion 

5.1 Implications for theories of visual 
representation 

In computer vision, one may discern three main theoret¬ 
ical approaches to object representation: pictorial rep¬ 
resentations, structural descriptions, and feature spaces 
(Ullman, 1989). According to the first approach, objects 
are represented by the same kind of geometric informa¬ 
tion one finds in a picture: coordinates of primitive ele¬ 
ments, which, in turn, may be as simple as intensity val¬ 
ues of pixels in an image (Lowe, 1987; Ullman, 1989; Pog- 
gio and Edelman, 1990; Ullman and Basri, 1991; Breuel, 
1992; Tomasi and Kanade, 1992; Vetter et al., 1997). 
Because of the effects of factors extrinsic to shape, this 
mode of representation can be used for recognition only 
if it is accompanied by a method for normalizing the ap¬ 
pearance of objects (Ullman, 1989) or, more generally, 
for separating the effects of pose from the effects of shape 
(Ullman and Basri, 1991; Tomasi and Kanade, 1992). 

It is not easy to adapt the pictorial approach to carry 
out categorization rather than recognition. One rea¬ 
son for that is the excessive amount of detail in pic¬ 
tures: much of the information in a snapshot of an ob¬ 
ject is unnecessary for categorization, as attested by the 
ability of human observers to classify line drawings of 
common shapes (Biederman and Ju, 1988; Price and 
Humphreys, 1989). Although a metric over images that 
would downplay within-category differences may be de¬ 
fined in some domains, such as classification of stylized 
“clip art” drawings (Ullman, 1996, p.173), attempts to 
classify pictorially represented 3D objects (vehicles) met 
with only a limited success (Shapira and Ullman, 1991). 

We believe that extension of alignment-like ap¬ 


proaches from recognition to categorization is problem¬ 
atic for a deeper reason than mere excess of information 
in images of objects. Note that both stages in the process 
of recognition by alignment (normalization and compar¬ 
ison; see Ullman, 1989) are geared towards pairing the 
stimulus with a single stored representation (which may 
be the average of several actual objects, as in Basri’s 
1996 algorithm). As we pointed out in the introduction, 
this strategy, designed to culminate in a winner-take- 
all decision, is inherently incompatible with the need to 
represent radically novel objects. 

The ability to deal with novel objects has been con¬ 
sidered so far the prerogative of structural approaches to 
representation (Marr and Nishihara, 1978; Biederman, 
1987). The structural approach employs a small num¬ 
ber of generic primitives (such as the thirty-odd geons 
postulated by Biederman), along with spatial relation¬ 
ships defined over sets of primitives, to represent a very 
large variety of shapes. The classification problem here is 
addressed by assigning objects that have the same struc¬ 
tural description to the same category. 

In principle, even completely novel shapes can be 
given a structural description, because the extraction of 
primitives from images and the determination of spatial 
relationships is supposed to proceed in a purely bottom- 
up, or image-driven fashion. In practice, however, both 
these steps proved so far impossible to automate. State 
of the art recognition systems in computer vision tend 
to ignore the challenge posed by the problems of catego¬ 
rization and of representation of novel objects (Murase 
and Nayar, 1995), or treat categorization as a kind of 
imprecise recognition (Basri, 1996). 

In contrast to all these approaches, the Chorus model 
is designed to treat both familiar and novel objects 
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equivalently, as points in a shape space spanned by sim¬ 
ilarities to a handful of reference objects. According to 
Ullman’s (1989) taxonomy, this makes it an instance of 
the feature-based approach, the features being similari¬ 
ties to entire objects. The minimalistic implementation 
of Chorus described in the preceding sections achieved 
recognition and generalization performance comparable 
to that of the state of the art computer vision systems 
(Murase and Nayar, 1995; Mel, 1997; Schiele and Crow¬ 
ley, 1996), despite relying only on shape cues where other 
systems use shape and color or texture or both. Fur¬ 
thermore, this performance was achieved with a low¬ 
dimensional representation (ten nominal dimensions), 
whereas the other systems typically employ about a hun¬ 
dred dimensions; for a discussion of the importance of 
low dimensionality in this context, see (Edelman and 
Intrator, 1997). Finally, our model also exhibited sig¬ 
nificant capabilities for shape-based categorization and 
for useful representation of novel objects; it is reason¬ 
able to assume that its performance in these tasks can 
be improved, if more lessons from biological vision are 
incorporated into the system. 

5.2 Implications for understanding object 
representation in primate vision 

The architecture of Chorus reflects our belief that a good 
way to achieve progress in computer vision is to follow 
examples set by biological vision. Each of the building 
blocks of Chorus, as well as its general layout, can be 
readily interpreted in terms of well-established proper¬ 
ties of the functional architecture of the primate visual 
system (Poggio, 1990; Poggio and Hurlbert, 1994). The 
basic mechanism in the implementation of this scheme 
is a receptive field — probably the most ubiquitous 
functional abstraction of the physiologist’s tuned unit, 
widely used in theories of biological information process¬ 
ing (Edelman, 1997a). The receptive fields at the front 
end of Chorus are intended to parallel those found in 
the initial stages of the primate visual pathway. 4 Fur¬ 
thermore, an RBF module of the kind used in the sub¬ 
sequent stage of Chorus can be seen also as a receptive 
field, tuned both to a certain location in the visual field 
(defined by the extent of the front-end receptive fields) 
and to a certain location in the shape space (correspond¬ 
ing to the shape of the object on which the module has 
been trained). 

Functional counterparts both of individual compo¬ 
nents (basis functions) of RBF modules and of entire 
modules have been found in a recent electrophysiological 
study of the inferotemporal (IT) cortex in awake mon¬ 
keys (Logothetis et ah, 1995). The former correspond to 
cells tuned to particular views of objects familiar to the 
animal; the latter — to cells that respond nearly equally 
to a wide range of views of the same object. It is easy 
to imagine how an ensemble of cells of the latter kind, 

4 Admittedly, the 200 elongated-Gaussian RFs used in our 
present simulations are too crude to serve as a model even 
of the primary visual cortex. A better preprocessing stage 
(e.g., a simulation of the complex-cell system described in 
(Edelman et al., 1997)) should be tested in conjunction with 
the Chorus scheme. 


each tuned to a different reference object, can span an 
internal shape space, after the manner suggested above. 

While a direct test of this conjecture awaits experi¬ 
mental confirmation, indirect evidence suggests that a 
mechanism not unlike the Chorus of Prototypes is de¬ 
ployed in the IT cortex. This evidence is provided by the 
work of K. Tanaka and his collaborators, who studied ob¬ 
ject representation in the cortex of anaesthetized mon¬ 
keys (Tanaka, 1992; Tanaka, 1996). These studies re¬ 
vealed cells tuned to a variety of simple shapes, arranged 
so that units responding to similar shapes were clustered 
in columns running perpendicular to the cortical surface; 
the set of stimuli that proved effective depended to some 
extent on the monkey’s prior visual experience. If fur¬ 
ther experimentation reveals that a given object consis¬ 
tently activates a certain possibly disconnected subset of 
the columns, and if that pattern of activation smoothly 
changes in response to a continuous change in the shape 
or the orientation (Wang et al., 1996) of the stimulus, 
the principle of representation of similarity that serves 
as the basis of Chorus would be implicated also as the 
principle behind shape representation in the cortex. 

The results of several recent psychophysical studies 
of object representation in primates support the above 
conjecture. In each of a series of experiments, which 
involved subjective judgment of shape similarity and de¬ 
layed matching to sample, human subjects (Edelman, 
1995a; Cutzu and Edelman, 1996) and monkeys (Sug- 
ihara et al., 1997) have been confronted with several 
classes of computer-rendered 3D animal-like shapes, ar¬ 
ranged in a complex pattern in a common parameter 
space (cf. Shepard & Cermak, 1973). In each experi¬ 
ment, processing of the subject data by multidimensional 
scaling (used to embed points corresponding to the stim¬ 
uli into a 2D space for the purpose of visualization) in¬ 
variably revealed the low-dimensional parametric struc¬ 
ture of the set of stimuli. In other words, the proximal 
shape space internalized by the subjects formed a faith¬ 
ful replica of the distal shape space structure imposed on 
the stimuli. Furthermore, this recovery was reproduced 
by a Chorus-like model, trained on a subset of the stim¬ 
uli and subsequently exposed to the same test images 
shown to the subjects. As we argue elsewhere, these find¬ 
ings may help understand the general issue of cognitive 
representation, and, in particular, the manner in which 
representation can conform, or be faithful, to its object 
(Edelman and Duvdevani-Bar, 1997; Edelman, 1997b); 
their full integration will require a coordinated effort in 
the fields of behavioral physiology, psychophysics, and 
computational modeling. 

5.3 Summary 

We have described a computational model of shape- 
based recognition and categorization, which encodes 
stimuli by their similarities to a number of reference 
shapes, themselves represented by specially trained ded¬ 
icated modules. The performance of the model (see Ta¬ 
ble 3) suggests that this principle may allow for efficient 
representation, and, in most cases, correct categoriza¬ 
tion, of shapes never before encountered by the observer 
— a goal which we consider of greater importance than 



mere recognition of previously seen objects, and which so 
far has eluded the designers of computer vision systems. 

The most severe limitations of the present model are 
(1) the lack of tolerance to image-plane translation and 
scaling of the stimulus, (2) the lack of a principled way 
of dealing with occlusion and interference among neigh¬ 
boring objects in a scene, and (3) the lack of explicit rep¬ 
resentation of object structure (a shortcoming it shares 
with many other feature-based schemes). Whereas it 
may be possible to treat translation and scaling effec¬ 
tively without abandoning the present approach (Vetter 
et ah, 1995; Riesenhuber and Poggio, 1998), its extension 
to scenes and to the explicit representation of structure 
must await future research. 
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A Theoretical aspects of the design of a shape-tuned module 

In this section, we study the theoretical underpinnings of the ability of an RBF module to overcome the variability 
induced by pose changes. Specifically, we show that once an RBF module is trained on a collection of object views, 
its response to views that differ from its centers (the training examples) in a small displacement along the view space 
spanned by the examples is always higher than its response to views that are orthogonal to or directed away from 
this view space. 

A.l The infinitesimal displacement case 

Assume the view space of a specific object shape can be sampled, and consider the sketch given in Figure 11, 
illustrating the following notation: 

• xi is a training view, — another, arbitrary, training view, i = 1,..., k. 

• Ax — a unit vector, (Ax) T Ax = 1. 

• t > 0, a parameter controlling the extent of the displacement in the direction of Ax. 

4 


t AX 



Figure 11: An illustration of the basic notations used in the text; xi, x^ are training views of a specific object 
shape, i = 1 ,...,&. £Ax is a vector representing a displacement from the view space spanned by the training 
vectors. The angle between tAx and xi — x^ indicates the direction of displacement. When all such angles are sharp, 
the displacement is away from the view space, whereas when there is at least one such angle that is obtuse, the 
displacement is towards one of the x*’s, and therefore towards the view space. 

Assume further that we train a (Gaussian) RBF network on a set of pairs {x;, 7 /;}*L l5 for X = {x*}jL 1 , a set of that 
object views, and a simple target y = {yi = 1]A =1 . For an input vector x, the corresponding RBF(x) activity is 
given by: 


RBF(x) 


k 

T. C, G(||x - Xi 
2=1 


) 


k 


y Cl e 

2=1 


[(x-xO T (x-x ,)] 2 /^ 2 


Let A = (a*), B = (bj), define G(A;B) to be a matrix whose entry (i,j) is the Gaussian e 
its simplest form means solving the equation 




. Training in 


y = G(x; X) • c, 

for the value of c. The solution is: 


c = G + (X; X) • y, 



where + denotes the (pseudo) inverse of G. 
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Thus, equation (1) takes the form 


RBF(x) = G(x; X) • G+(X; X) • y. (3) 

Upon successful training, RBF(x i) = 1 — e, e <C 1. We now compute the change in RBF behavior resulting from 
an infinitesimal displacement from a training vector xi, in an arbitrary direction. 


dRBF{x + tAx) ! 

□7 I X=X 1 = 

t> 0, 

c e -[(Xi+tAX-X0 T (Xi+tAX-X0] 2 /^ 2 
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Y' c . e -[(Xi+tAX-X0 T (Xi+tAX-X0] 2 /fT 2 . 

2=1 

•^{-[(xi + iAx - Xj) T (xi + tAx - Xj)] 2 /cr 2 }. 

Denote 
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D = — [-(xi + tAx - Xi) T (x i + tAx - Xi)] 2 /cr 2 . 

2 

D =--(xi + tAx - x 4 ) t (xi + tAx - x*)- 

CT z 

' Ft [(xi + ^X “ X 2) T ( X 1 + ^ X - X 2) • 

Since Ax is a unit vector, and by the commutativity of the inner product, we consequently have, 


(xi + tAx - x 4 ) t (xi + tAx - x*) = 

(xi - x,) T (xi - Xj) + 2t(Ax) T (xi - Xi). 


and. 


d 

dt 


[(xi + tAx - x 4 ) t (xi + tAx - x*)] = 2(Ax) i (xi - x z ) + 21 


T, 


Thus. 


D — — 


a a 


xi - x*|| + 2t(Ax) T (xi - x,)][2(Ax) i (xi - x z ) + 21 


T, 


(5) 


Consider the following two possible cases: 

(T) Vi (Ax) T (xi - x z ) > 0, 

(B) 3i (Ax) T (xi — Xi) < 0. 

Note that case (B) means that the direction of change, determined by the vector Ax is along the view space spanned 
by the Xi, i = 1,... fc, whereas in case (T), the direction of the displacement is orthogonal or away from the view 

space (see, again, Figure 11). Denote, di = ||xi — x^||, Ai = (Ax) T (xi — Xi), and note that di > 0. With the new 
notation, equation (5) becomes, 


D —-+ 2tAj)(2Aj + 2t) 


a 

4 


—-r-(djAj + dit + 2tAj + 2t Ai), 


CT 


and when t goes to zero, this yields. 


D —►- -d z A, 

t^> o cr z 


Consequently, in the limit for t —> 0, from equation (5) we have. 


dRBF(x + tAx) 
dt 


k d, 2 


x=x 1 —> e 

t>0, 0 ^o i=1 


A, 2 


(-— d z A z ). 


cr 


(6) 



, dj 2 

Denote this limit by L, L = — 4?- e A * 2 . 

For case (A), A* >0, Vz; Therefore, if all C{ > 0, we would have < 0, La < T#, for La, Lb the values of 
the limit L, for cases (A) and ( B ), respectively. This means that an infinitesimal displacement along the view space 
results in a smaller change of the corresponding RBF activity than the RBF change resulted from a displacement 
that is orthogonal to, or away from the view space. This establishes the desired property of an RBF- based classifier 
— an approximate constant behavior for different views of the target shape, with the response falling off for views 
of different shapes — for the infinitesimal view change case. 

Claim A.l c* > 0, Vz = 1,..., k. 


Proof: 

From equation (2) we have c* = V (G + );jZ/j, the sum of elements in the i th row of the matrix G + , where yj are 


-d 2 / 

the targets, yj = 1, j = and G + is the (pseudo) inverse of G whose elements are Gij = e L' 


cr 


for 


A A 

dij — Xj Xj 

a proper bound 


Note that G = I + A, where I is a unit matrix 5 , and A is a matrix whose elements are <C 1, under 
on a (see below). Thus, by Taylor expansion for the matrix G, we have, 


G+ = --- « I - A + 0(A 2 ) 

I + A K J 


To complete the proof, let a < (In A;) 1 / 2 min dij, for k - the number of training vectors. Thus, for all z and j 


i<j 


dij > cr(lnfc) 1 / 2 , d 2 j > a 2 Ink, and < — In k = In Taking the exponent of both terms, we obtain 


d 2 . 


In -r — 


1 


e .2 < e r = - 

k 


As a result, the sum of elements in any row of G + consists of 1 (the element on the diagonal, contributed by the 
unit matrix) minus k — 1 elements, each smaller than Thus, we finally have, 


c* 


Vz — 1,..., k , 

k — 1 d 2 k — 1 

1 - E e- ^ > 1 - Ei = 1 - ^ > °- 

3 = 1 J=1 


A.2 The finite displacement case 

We next extend the above proof to a finite view displacement. As before, we consider a change in object appearance 
due to (a) the extrinsic effect of pose, i.e. a change along view space direction (object rotation), and (b) an intrinsic 
shape change, that is, a change orthogonal to, or away from the view space (shape deformation). 

First, note that the two factors determining the two-dimensional appearance of an object, the shape and pose, 
are orthogonal. To demonstrate this, we have simulated shape and pose variation for three-dimensional objects 
consisting of a collection of points in 3D. For such a point-cloud object, shape deformation is simulated by a random 
displacement of the cloud’s points, whereas a change of pose simply means an arbitrary rotation of all points. The 
two-dimensional appearance of the deformed, or rotated object is obtained by an orthographic projection, and the 
displacement from the two-dimensional appearance of the original cloud is measured. The inner product between 
the two vectors, representing the changes in appearance caused by rotation and deformation, is calculated to find the 
cosine of the angle between the shape and pose displacements. Figure 12 shows the above calculation for different 
combinations of shape and pose variations, averaged over many independent runs. Indeed, for a significant range of 
variation, orthogonality is observed between the shape and pose factors that determine the appearance of an object. 
Now, let xi be, as before, an arbitrary training view of the object, and let Av, Ap, be finite displacements along, 
and in perpendicular to view space, respectively. 

Note that because Gaussians are factorizable, and because the view-space and the shape-space projections of an 
object appearance are orthogonal to each other, we have 

- || X—11| 2 -||XP-tP|| 2 -||X v -t v || 2 

Cr(||x — 1 11) = e —e e ^ 2 . (7) 

Consider now a displacement within an object view space. This change in the object’s (two-dimensional) appearance 
results from a (three-dimensional) rotation of the object away from some reference view. The upper bound on this 
kind of change is therefore finite. To see that, recall that both {x^}^ and x are different two-dimensional views of 
the same object, resulting from projection of the corresponding three-dimensional “views,” z = 1 ...,&, and T, 


'20 


5 • dp 1 / ^ 

Vz, da = 0, thus, e~ ii/cr =1 are the diagonal elements 




Figure 12: Orthogonality of shape and pose. The displacement in the two-dimensional appearance of a three- 
dimensional 10-point cloud object due to variations in pose and shape is measured, assuming orthographic projection. 
The plot shows the average value of the cosine of the angle between the shape and pose displacements, calculated for 
20, 000 randomly chosen values of pose variation (an arbitrary rotation of the cloud’s points), and shape deformation 
(a random displacement of the cloud’s points). Data were gathered into a small number of bins, sorted by the angle 
of rotation (shown in radians along the pose axis), and by the amount of shape deformation, measured as the fraction 
of the random displacement with respect to the total cloud distribution (shape axis). 


respectively. That is, x = VX, x* = VXi, where V is a 3D —► 2D projection. Any three-dimensional view can be 
described by an object rotation i? n (^) away from some orientation, say X c in the three-dimensional space. 

Thus, 


x - Xi|| = 
\\VX-VX % 


I 'PRih (^i) X c — VRin (k h ) X c 


Under orthographic projection, the difference between projected vectors is the projection of their difference,and the 
norm which can only be reduced by projection, is preserved by the rotation mapping (Kanatani, 1990). Thus, 


||^ > [^ni {p>i)X c — R^ [iOi)X c ] || < 

||^ni(^i)T c — Rui(^i)X c \\ < 

\Rni (wi)X c || + ll-Rni {^i)X c || = 

||T C || + ||T C || =2 ||T C ||. 

Thus, an upper bound on the extent of the view space displacement is easily established. We denote this bound by 
D. Let x = xi + Av. From the above, \\Av\\ < D. By triangle inequality, 


x — x, 


(xi + Av) — x* || < 


(xi + Av) — xi|| + 11 xi — x* || = \\Av\\ + 11 xi — x 


Hence. 


|X - Xi|| 2 > - [|| Av || 2 + 211 Av 11 • ||xi - 
As a consequence, because all c* are positive (Claim A.l), 
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Now, let cr < 2 min dij , for dij = ||x* — Xj 

i<j 

Thus, ||xi — x*|| > 

Because \\Av\\ < D , and —2 \\Av\\ > —2D, we have. 


—21| Au|| 11 xi — x, 


c r 
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> - 
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Finally, 
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a e ^ Xl x dl ' a • e o- 2 • e o- 
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for 


and. 


.-£(i + S) 


RBF(x) > e~F\ L ^ ~> . .RBF(xi) 


£> < a, F = — < 1 
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e -F(l + F) > q . 


Now, for a finite displacement in perpendicular to the view space, x = xi + Ap, we have by orthogonality (equa- 
tion (7)), 
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RBF{x) = Y, 


;xi+Ap)-x,|| 2 /o- 2 _ 


Ci e 


2=1 
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e -IIXi-x .|| 2 /^ 2 . e -||Ap|| 2 / CT 2 = RBF ( Xl ) . e 


2 / _2 


||Ap|| 2 / CT 2 


2=1 


For an arbitrary amount of shape-space displacement, say, Ap 0, e ll^^ll 2 /°" 2 1 can become arbitrarily small, 

since —A p 2 <C 0 => e — ll^^ll 2 /°" 2 1. 

Hence we finally have, for a shape displacement, 

RBF(x.) < e“ l|Ap||2/fj2 DDF(xi) < DDF(xi). 

From the above arguments, we may conclude that (1) any displacement within the view space of the target object 
results in an RBF activity that cannot be less than some positive, not too small, fraction of its activity on the training 
examples, whereas (2) for a displacement in perpendicular to the view space, the corresponding RBF activity is always 
below the activity obtained in training, with the activity decreasing for increasing shape differences. 


B Training individual shape-specific modules 

To train an RBF module one needs to place the basis functions optimally as to cover the input space (i.e., determine 
the basis-function centers), calculate the output weights associated with each center, and tune the basis-function 
width. 

B.l Finding the optimal placement for each basis function 

Whereas the computation of the weight assigned to each basis function is a linear optimization problem, finding 
the optimal placement for each basis in the input space is much more difficult (Poggio and Girosi, 1990). Here, we 
consider a simplified version of this problem, which assumes that a small optimal subset of examples to be used in 
training is chosen out of a larger set of available data, consisting of views of the shape on which the module is trained. 
Views are given by their measurement-space representations (here, we used a small collection of filters with radially 
elongated Gaussian receptive fields, randomly positioned over the image (Weiss and Edelman, 1995); see Figure 4). 
This approach leads naturally to the question of the definition of optimality. Defining an optimal subset of views as 
the subset that minimizes the nearest-neighbor classification error amounts to performing vector quantization (VQ; 
see appendix C) in the input space (Moody and Darken, 1989; Poggio and Girosi, 1989). 

By definition, quantizing an input space results in a set of vectors that are the best representation of the entire 
space. A quantization is said to be optimal if it minimizes an expected distortion. Simple measures of the latter, 
such as squared Euclidean distance, while widely used in vector quantization applications (Gersho and Gray, 1992), 
do not correlate well with the subjective notion of distance appropriate for the task of quantizing an object view 
space. Specifically, Euclidean distances in a pixel space do not reflect object identities if the illumination conditions 
are allowed to vary (Adini et ah, 1997). Likewise, in a Euclidean receptive-field (RF) space, images of similar objects 
tend to cluster together by view, not by object shape, ^objects may rotate (Duvdevani-Bar and Edelman, 1995; 




Figure 13: A set of 49 views of one of the figure-like test objects (Al), taken at grid points along an imaginary viewing 
sphere centered around the object. Views differ in the azimuth and the elevation of the camera, both ranging between 
—60° and 60° at 20° increments. We used the Canonical Vector Quantization (CVQ) procedure to select the most 
representative views for the purpose of training the object representation system (section B.l; the selected views of 
Al are marked by frames. 


Lando and Edelman, 1995). This implies that Euclidean distance between RF representation of object views cannot 
overcome the variability in object appearance caused by changes in viewing conditions, and that a different measure 
of quantization distortion is needed. 

The measure we incorporated in the present model is canonical distortion, proposed by Baxter (1996). The notion of 
canonical distortion is based on the observation that in any given classification task, there exists a natural environment 
of functions, or classifiers, that allow for a faithful representation of distance in the input space. The property shared 
by all such classifiers is that their output varies little across instances of the same entity (class); ideally, the output of 
a particular classifier is close to one if the input is an instance of its target class, and is close to zero otherwise. Thus, 
in the space of classifier outputs instances of the same class are closer together, and instances of different classes 
farther apart, than in the input space. According to Baxter, the distortion measurement induced by the classifier 
space is the desired canonical distortion measure. 6 

Following Baxter’s ideas, we sample the view space of an object at a fixed grid wrapped around the viewing sphere 
centered at the object (see Figure 13), then canonically quantize the resulting set of object views. The representative 
views, which are subsequently used to train the object-specific modules, are chosen in accordance with the following 
three criteria. First, a classifier (i.e., module) output should be approximately constant for different views of its 
selected object. Second, views of the same object should be tightly clustered in the classifier output space. Third, 
clusters corresponding to views of different objects should be separated as widely as possible. 

We have combined these three criteria in a modified version of the Generalized Lloyd algorithm (GLA) for vector 
quantization (Linde et al., 1980), known also as the k- means method (MacQueen, 1967). In contrast to the conven¬ 
tional GLA, which carries out quantization in the input vector space, our algorithm concentrates on the classifier 
output space. Training an RBF network on the centers of clusters resulting from the optimal partition of the classifier 
output space addresses the first of the three requirements — an approximately constant output across views of an 
object. The other two requirements are addressed by a simultaneous minimization of the ratio of between-objects to 
within-object view scatter (a cluster compactness criterion; see Duda and Hart, 1973). 

6 Formally, for an environment of functions / G 7, mapping a probability space (A, P, ax) into a space (F, a), with 
a : Y x Y — > R, a natural distortion measure on X, induced by the environment is p(x, y ) = /, o'C/O'O, f(y))dQ(f), for 
x, y G X, and Q an environmental probability measure on T. 





















Increasing the number of examples on which a classifier is trained always improves both the RBF-module classifier 
performance and the view-space compactness criterion (see Figure 14). Our version of Baxter’s Canonical Vector 
Quantization (CVQ) relies on this observation by taking the so-called “greedy” algorithmic approach. The algorithm 
is initialized with an empty set of views and adds new views iteratively. At each iteration, the new view is chosen 
so as to minimize the compactness criterion, and the entire process follows the gradient of improvement in classifier 
performance (see appendix C.l, for details). 




Figure 14: The effect of training-set size on the performance of an RBF module trained under the compactness 
criterion. Left: the recognition error obtained for the Nissan module, trained as a part of a network consisting of ten 
object modules (see Figure 5 below). For each object, training involved a set of N = 49 views, taken as described 
in Figure 13. The abscissa is the number t of the training vectors (examples). For t < 15 or so, the performance of 
the module trained on the CVQ-derived code vectors (dashed line) is better than the error obtained with the same 
number of randomly chosen training vectors (solid line). When t is large, the resulting error is low in any case. Right: 
The compactness criterion (the ordinate), defined as the ratio of between-cluster to within-cluster scatter (Duda and 
Hart, 1973), plotted against the size of the training set. Note that the values of the compactness criterion obtained 
for the CVQ code vectors (dashed line) are significantly better (lower) than the values obtained for a module trained 
on the same number of randomly chosen vectors (solid line). In both plots, the error bars represent the standard 
error of the mean, calculated over 25 independent random choices of the training vectors. 


B.2 Tuning the basis-function width 

A complete specification of an RBF module consists of the choice of basis function centers, the output weights 
associated with each center, and the spread constant, or the width, of the basis functions. The width parameter has 
a direct influence on the performance of an RBF classifier (i.e., its ability to accept instances of the class on which it 
is trained and to reject other input). Optimally, the width parameter should be set to a value that yields equal miss 
and false-alarm error rates (see Figure 15). Following the rule of thumb according to which the width parameter 
should be much larger than the minimum distance and much smaller than the maximum distance among the basis 
centers, we employ a straightforward binary search to optimize its value. 


C Vector quantization 

Vector quantization (VQ) is a technique that has been originally developed for signal coding in communications and 
signal processing. It is used in a variety of tasks, including speech and image compression, speech recognition and 
signal processing (Gersho and Gray, 1992). 

A vector quantizer Q is a mapping from a d-dimensional Euclidean space, S , into a finite set C of code vectors , 

Q : S —> C, C = (pi,P 2 ? • • • ,Pn),Pi £ <S,i = 1,2,... ,n. Associated with every 72-point vector quantizer is a partition 
of S into 72 regions, Ri = {x E S : Q(x) = pi}. 

Vector quantizer performance is measured by distortion d(x, x) — a cost associated with representing an input 
vector x by a quantized vector x. The goal in designing an optimal vector quantization set is to minimize the 
expected distortion. The most convenient and widely used measure of distortion is the squared Euclidean distance. 


C.l The generalized Lloyd (K-means) algorithm 

The generalized Lloyd algorithm (GLA) for vector quantizer design (Linde et ah, 1980) is known also as the k- 
means method (MacQueen, 1967). According to the algo^hm, an optimal vector quantizer is designed via iterative 
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Figure 15: The effect of the basis function width (cr) on the performance of an RBF module. Left: RBF-module 
miss rate (dashed line), false-alarm rate (dotted line) and their mean (solid line), plotted against a. The values 
of a shown on the abscissa range from half the minimal distance up to the maximal distance among RBF-module 
“centers” (training views) in the input space. 

codebook modifications to satisfy two conditions: nearest neighbor (NN) and centroid condition (CC). The former 
is equivalent to constructing the Voronoi cell of each code vector, whereas the application of the latter is aimed to 
adjust each code vector to be the center of gravity of its domination region. The means of the (k) initial clusters are 
found, and each input point is examined to see if it is closer to the mean of another cluster than it is to the mean 
of its current cluster. In that case, the point is transferred and the cluster means (centers) are recalculated. This 
procedure is repeated until the chosen measure of distortion is sufficiently small. 

C.2 The Lloyd algorithm modified to perform canonical quantization 

We next present our modification of the GLA for the canonical vector quantization (CVQ) design. 

1. Initialization: Set N = 2, an initial codebook size. Set En = oo. Set C N to be an initial codebook of size N. 
The codebook is randomly chosen from the input set. 

2. Find an input vector for which the compactness is optimal, and add it to C N to create a codebook C 7V+1 of 
size N + 1. 

(a) Set iteration m — 1, D m = oo. 

(b) Given the codebook perform the modified Lloyd Iteration on the classifier output space to generate 
the improved codebook C^ +1 . 

(c) Compute the sum-of-squared-error D m . If Dm ~ Dm + 1 < e for a suitable threshold e, halt. The improved 

v 7 E'm 

codebook C^ +1 is the set of input vectors, whose classifier outputs are the closest to the codevectors 
constituting the improved output codebook (see below). 

Otherwise, set ra <— ra + 1, go to Step (b). 

3. Calculate the classifier generalization error E jy. If the criterion En ~^ n+1 < e is satisfied, finish. Otherwise, 
set N <— N + 1, go to Step (2). 

The modified Lloyd Iteration: 

1. Compute classifier activity over the input set, denote this set by O. Denote the set of classifier outputs on the 
codebook the output codebook. 

2. Partition the set O into clusters using the Nearest Neighbor Condition , for the output-codebook vectors being 
the cluster centers. 

3. Using the Centroid Condition , compute the centroids for the clusters just found, to obtain a new output 
codebook. 
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Figure 16: The activity of several RBF modules obtained for 100 test views (25 views for each of four objects). The 
views, which vary along the abscissa, are grouped, so that the first 25 views belong to the first object (cow, solid line), 
with the subsequent views, in groups of 25, belonging, respectively, to cat (dotted line), tuna (dashed line), and 
TRex (dash-dotted line). Note that each classifier responds strongly to views of its target object, and significantly 
less to views of other objects. 

D Additional tables 



cowl 

cat 

A1 

gene 

tuna 

Lrov 

Niss 

F16 

fiy 

TRex 

cowl 

4.04 

1.86 

Kssm 

1.62 

0.91 

1.22 

1.79 

1.21 

0.71 

0.53 

cat2 

1.69 

3.55 

■SI 

1.02 

1.10 

1.20 

2.10 

1.04 

0.61 

0.53 

A1 

0.08 

0.06 

1.63 

0.46 

0.03 

0.12 

0.06 

0.09 

0.19 

0.06 

gene 

0.61 

0.43 

0.44 

5.24 

0.14 

0.11 

0.26 

0.48 

0.55 

0.25 

tuna 

1.57 

2.00 

0.40 

1.11 

4.22 

1.41 

3.05 

1.77 

0.72 

1.02 

Lrov 

0.57 

0.56 

0.17 

0.20 

0.23 

3.36 

1.38 

0.36 

0.16 

0.11 

Niss 

0.67 

0.86 

0.06 

0.34 

0.82 

0.97 

3.24 

0.88 

0.21 

0.25 

F16 

0.50 

0.44 

0.11 

0.65 

0.58 

0.27 

0.94 

2.14 

0.24 

0.25 

fiy 

1.03 

1.08 

0.88 

2.30 

0.60 

0.70 

0.95 

0.84 

3.71 

0.99 

TRex 

0.28 

0.34 

0.09 

0.60 

0.32 

0.14 

0.44 

0.36 

0.29 

3.67 


Table 4: RBF module activities (averaged over all 169 test views) evoked by the trained objects. Each row shows 
the average activation pattern induced by views of one of the objects over the ten reference-object RBF modules; 
boldface indicates the largest entry (see section 4.1). 
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cowl 

cat2 

A1 

Gene 

tuna 

Lrov 

Niss 

F16 

fiy 

TRex 

frog 

0.38 

0.28 

0.29 

0.18 

0.35 

0.20 

0.11 

0.09 

0.99 

0.16 

turtle 

0.53 

0.32 

0.38 

0.64 

0.39 

0.13 

0.09 

0.13 

0.93 

0.17 

shoe 

0.51 

0.63 

0.06 

0.12 

1.09 

0.46 

0.54 

0.33 

0.59 

0.16 

pump 

1.33 

1.44 

0.01 

0.17 

2.37 

0.32 

1.02 

0.40 

0.83 

0.19 

Beetho 

0.09 

0.05 

0.10 

0.02 

0.07 

0.05 

0.01 

0.01 

0.38 

0.01 

girl 

2.66 

1.78 

0.13 

3.27 

2.55 

0.20 

0.73 

1.07 

2.03 

0.86 

lamp 

0.72 

0.48 

0.71 

0.70 

0.41 

0.36 

0.09 

0.09 

1.53 

0.09 

manate 

1.49 

0.98 

0.09 

0.36 

2.47 

0.35 

1.45 

0.68 

0.84 

0.24 

dolphi 

1.14 

0.98 

0.04 

0.34 

2.20 

0.23 

0.68 

0.51 

0.72 

0.13 

Fiat 

1.51 

1.77 

0.01 

0.12 

3.76 

0.46 

2.27 

0.87 

0.79 

0.27 

Toyota 

2.16 

2.13 

0.10 

0.25 

2.50 

2.00 

2.29 

0.69 

0.83 

0.30 

tank 

1.85 

1.91 

0.09 

0.51 

2.50 

1.04 

2.36 

1.46 

1.08 

0.56 

Stego 

2.04 

2.13 

0.06 

0.67 

3.61 

0.67 

2.45 

1.46 

1.58 

0.98 

camel 

2.20 

1.34 

0.04 

0.77 

1.75 

0.30 

0.65 

0.54 

1.02 

0.23 

giraff 

1.87 

1.93 

0.03 

0.54 

3.24 

0.19 

1.04 

1.21 

1.63 

1.72 

Gchair 

1.75 

1.69 

0.00 

0.09 

3.04 

0.29 

1.40 

0.76 

0.86 

0.19 

chair 

2.64 

2.65 

0.02 

0.44 

4.05 

0.82 

2.39 

1.06 

1.78 

0.51 

shell 

1.89 

1.09 

0.25 

1.56 

0.95 

0.44 

0.40 

0.49 

1.66 

0.35 

bunny 

1.07 

1.24 

0.23 

0.22 

1.10 

1.47 

0.53 

0.28 

0.95 

0.30 

lion 

0.55 

0.59 

0.09 

0.13 

0.54 

0.61 

0.20 

0.09 

0.60 

0.13 


Table 5: RBF activities (averaged over all 169 test views) for the 20 test objects shown in Figure 9. In each row 
(corresponding to a different test object), entries within 50% of the maximum for that row are marked by boldface. 
These entries constitute a low-dimensional representation of the test object whose label appears at the head of the 
row, in terms of similarities to some of the ten reference objects. For instance, the manatee (an aquatic mammal 
known as the sea cow) turns out to be like (in decreasing order of similarity), a tuna, a cow, and, interestingly, but 
perhaps not surprisingly, a Nissan wagon. 
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obj 

cowl 

cat2 

A1 

gene 

tuna 

Lrov 

Miss 

F16 

fiy 

TRex 

QUAD 

cow2 

0.69 

0.30 




0.01 






ox 

0.93 

0.04 

0.02 

0.02 








calf 

0.86 

0.06 



0.06 


0.01 

0.02 




deer 

0.34 

0.62 



0.03 


0.01 





Babe 

0.88 

0.05 




0.04 



0.03 



PigMa 

0.83 

0.12 





0.02 


0.04 



dog 

0.33 

0.64 



0.01 


0.01 

0.01 




goat 

0.20 

0.69 

0.04 

0.06 





0.02 



buff 

0.72 

0.17 


0.03 

0.01 

0.03 



0.05 



rhino 

0.69 

0.15 



0.01 

0.02 

0.11 

0.01 



FIGS 

pengu 

0.30 

0.11 


0.28 



0.01 

0.01 

0.29 



ape 

0.11 

0.11 

0.31 






0.47 



bear 

0.08 

0.07 


0.75 



0.01 


0.10 



cands 


0.16 

0.74 






0.10 



king 



0.67 

0.09 





0.24 



pawn 



0.73 






0.27 



venus 



0.86 

0.01 





0.13 



lamp 

0.04 


0.64 



0.04 



0.28 



lamp 2 

0.03 


0.70 






0.27 



lamp 3 



0.70 

0.14 





0.17 


FISH 

whale 

0.08 

0.11 



0.80 



0.01 




whalK 

0.04 

0.04 



0.91 




0.01 



shark 

0.03 

0.07 



0.89 





0.01 


Marin 


0.01 



0.98 


0.01 





whalH 

0.10 

0.20 



0.70 






AIR 

F15 

0.12 

0.08 



0.02 


0.02 

0.72 


0.03 


F18 

0.09 

0.07 



0.06 


0.01 

0.78 




Mig27 

0.05 

0.37 

0.14 


0.12 



0.31 




shutl 

0.24 

0.31 



0.30 



0.13 


0.02 


Ta4 

0.11 

0.17 



0.10 


0.02 

0.55 


0.05 

CARS 

Isuzu 

0.07 

0.07 




0.04 

0.83 





Mazda 

0.04 

0.07 




0.01 

0.88 





Mrcds 

0.04 

0.04 





0.92 





Mitsb 

0.04 

0.07 




0.01 

0.89 





NissQ 

0.07 

0.08 




0.01 

0.83 


0.01 



Subru 

0.04 

0.04 





0.92 





SuzuS 

0.13 

0.17 



0.08 

0.30 

0.33 





ToyoC 

0.09 

0.07 




0.05 

0.79 





Beetl 

0.03 

0.09 





0.87 


0.01 



truck 

0.07 

0.05 





0.89 




DINO 

Paras 

0.01 

0.05 



0.01 





0.93 


Veloc 


0.03 



0.24 



0.02 


0.71 


Alios 


0.21 



0.36 


0.04 

0.02 


0.36 


Table 6: Categorization results for the 43 test objects shown in Figure 6, for the k -NN method of section 4.2.4, 
with k = 3. Each row corresponds to one of the test objects; the proportion of the 169 test views of that object 
attributed to each of the categories present in the training set appears in the appropriate column. Note that the 
misclassification rate depends on the definition of category labels. Here, mean misclassification rate, over all 169 
views of all objects, was 22% for the first set of category labels (i.e., the seven categories illustrated in Figure 5), 
16% for the second set of labels (according to which the fly and the FIGURES have the same label), and 14% for the 
third set of labels (where in addition the tuna and the F16 have the same category label). 
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U Test Objs 

jj Reference Objs 

1 

5 

10 

15 

20 

2 

0 

0 

0 

0 

0 

5 

0.077 

0.011 

0.006 

0.008 

0.006 

10 

0.140 

0.024 

0.009 

0.008 

0.007 

25 

0.183 

0.026 

0.009 

0.005 

0.005 

50 

0.055 

0.022 

0.012 

0.008 

0.007 


Table 7: Error rate obtained for the discrimination task vs. the number of test and reference objects (these data are 
also plotted in Figure 10). The error rate in entry (Np, Nt ) is the mean error rate obtained for the discrimination task 
using the activities of Np reference objects, and tested on 25 views of each of the Nt test objects, employing the 3-NN 
procedure of section 4.2.2. The mean is taken over 10 independent choices of Np objects out of 20 available reference 
objects, and 10 random selections of Nt objects out of a set consisting of 50 test objects (total of (5 • 10) (5 • 10) = 2500 
independent trials). 


29 



